\j biodiversity
fe^Heritage
^^Library
http://www.biodiversitylibrary.org
The University of Kansas science bulletin.
[Lawrence] :University of Kansas, 1 902-1 996.
http://www.biodiversitylibrary.org/bibliography/3179
38, pt. 2: http://www.biodiversitylibrary.org/item/23745
Page(s): Page 1409, Page 1410, Page 1411, Page 1412, Page 1413, Page 1414, Page 1415,
Page 1416, Page 1417, Page 1418, Page 1419, Page 1420, Page 1421, Page 1422, Page
1423, Page 1424, Page 1425, Page 1426, Page 1427, Page 1428, Page 1429, Page 1430,
Page 1431, Page 1432, Page 1433, Page 1434, Page 1435, Page 1436, Page 1437, Page 1438
Contributed by: Harvard University, MCZ, Ernst Mayr Library
Sponsored by: Harvard University, Museum of Comparative Zoology, Ernst Mayr Library
Generated 9 February 2009 12:50 PM
http://www.biodiversitylibrary.org/pdf1/000194800023745
This page intentionally left blank.
THE UNIVERSITY OF KANSAS
SCIENCE BULLETIN
Vol. XXXVIII, Pt. II] March 20, 1958
[No. 22
A Statistical Method for Evaluating Systematic
Relationships 1
BY
Robert R. Sokal and Charles D. Michener 2
Department of Entomology
University of Kansas, Lawrence
Abstract. Starting with correlation coefficients (based on numerous char-
acters) among species of a systematic unit, the authors developed a method
for grouping species, and regrouping the resultant assemblages, to form a classi-
ficatory hierarchy most easily expressed as a treelike diagram of relationships.
The details of the method are described, using as an example a group of bees.
The resulting classification was similar to that previously established by classi-
cal systematic methods, although some taxonomic changes were made in view
of the new light thrown on relationships. The method is time consuming, al-
though practical in isolated cases, with punched-card machines such as were
used; it becomes generally practical with increasingly widely available digital
computers.
INTRODUCTION
The purpose of the study reported here was to develop a quanti-
tative index of relationship between any two species of a higher
systematic unit, as well as to exploit such indices of association in
the establishment of a satisfactory hierarchy. The authors became
interested in the development of such a method when they at-
tempted to find a technique for classifying organisms that was free
from the subjectivity inherent in customary taxonomic procedure.
1. Contribution number 945 from the Department of Entomology, University of Kansas.
2. We wish to acknowledge the constructive criticism received in connection with this
and related work from the following individuals who kindly gave their time to read and
comment upon the manuscript: Paul R. Ehrlich, University of Kansas; Raymond B. Cat-
tell, University of Illinois; Alfred E. Emerson, University of Chicago; Warwick E. Kerr,
Universidade de Sao Paulo; Ernst Mayr, Harvard University; Louis L. McQuitty, Michigan
State University; G. G. Simpson, American Museum of Natural History; Peter C. Silvester-
Bradley, University of Kansas and University of Sheffield; and Paulo E. Vanzolini, De-
partmento de Zoologia, Secretaria de Agricultura, Sao Paulo. These persons, however, are
not responsible for the opinions which we have expressed.
Acknowledgment is also due to the University of Kansas General Research Fund for
assistance.
( 1409 )
1410 The University Science Bulletin
The systematic group chosen as a test of the feasibility of this under-
taking was one consisting of 97 species of solitary bees in the family
cgaehilidae. This choice was made because one of us (C. D. M.)
has made recent systematic studies of these insects, so that conclu-
sions as to the relationships obtained by the usual systematic pro-
cedure could be compared with the results of the new method.
The findings of our study as well as the philosophical bases of
our attempts at quantifying systematic relationships have been re-
ported elsewhere (Michener and Sokal, 1957). In this paper we
propose to describe in some detail the actual method employed,
as well as our reasons for adopting it and for rejecting several alter-
nate procedures. It is our intention to illustrate the procedures in
sufficient detail so that persons with a limited knowledge of statisti-
cal methods will be able to follow our method. We expect our
system to be applicable to most organisms, provided they exhibit a
variety of characters, and the account to follow is consequently
phrased in general terms. However, our practical illustrations are
based on the bee group cited above in order to provide the reader
with concrete examples.
A quantitative method of finding the relationship between two
species must be based on a number of taxonomic characters in a
manner similar to the traditional systematic approach. However,
whereas the latter technique generally uses few characters and
weights these quite unequally and subjectively, the former method
employs numerous but unweighted characters. Our reasons for not
weighting characters have been detailed in the companion paper
( Michener and Sokal, 1957). In the absence of an objective criterion
of character weight it seems best to rely on a large number of
equally weighted characters. In our bee study we employed 122
characters per species; however, we feel significant results may be
obtained from as few as 60 characters.
Our use of the word "character" will require some elaboration.
In its commonest taxonomic usage, a character is any feature of one
kind of organism that differentiates it from another kind. Thus the
red abdomen of one bee is a character distinguishing it from another
bee with the abdomen black. In this paper we use the word in a
second connotation only; that is, as a feature which varies from one
kind of organism to another. Now, to use the above example, ab-
dominal co'or is the character, which occurs in two "states" or alter-
natives, red and black.
Evaluating Systematic: Relationships 1411
For each character the states were coded: 1, 2, 3, etc. In the
bee study the number of states per character ranged from two to
eight. Much variation in the number of states is undesirable from
the point of view of the methods discussed below. In the study we
undertook most characters had either three or four states. How-
ever, when variation exceeds desirable bounds it might be prefer-
able to divide the character state codes by a common denominator
or to normalize them.
The kinds of characters used in the bee study and the manner in
which they were coded are discussed at length by Michener and
Sokal ( 1957). The possible effect of parallelism is also treated in the
same article. For purposes of the present paper the available data
might be summarized as follows: we have records of a given num-
ber (n) of species. For each species we have k records, k being
the number of characters considered in the study. The coded values
for any character may range from 1 to 9 depending on the number
of states in which this character occurs in the group under con-
sideration. As was mentioned previously it is desirable to have the
number of states not differ too widely for the various characters.
While it is not necessary to limit the number of possible character
states to nine, our particular computational setup was greatly fa-
cilitated by the use of a single digit code.
PROCEDURES
Character correlations and species correlations
Two obvious ways suggested themselves to the authors regard-
ing a procedure for deducing relationships from the character states
a group of species. We could either correlate characters with
each other or species with each other. Since both of these methods
would lead to interpretable, although differing, results a brief dis-
cussion of the implications of the two approaches follows.
Sturtevant (1942) undertook a study of the genus Drosophila
with objectives and procedures somewhat similar to ours. He re-
corded 33 morphological, cytological and life history characters for
each of 56 species of Drosophila and two species of the genus
Scaptomyza. In his aim to develop a classification "as free from
personal bias as I could make it," Sturtevant set up two tables. The
first was a table of the total number of differences with respect to the
33 chosen characters between any two of the 58 species. These give
the degree of difference between the species concerned and are
1412 The University Science Bulletin
analogous to the complemental values of the "matching
discussed in the section on Choice of a Correlation Coefficient be-
low.
A second table showed correlations between characters, expressed
as two-way frequency distributions. By examining the three high-
est character correlations Sturtevant found that six species con-
sistently fell into the exceptional classes of the two-way frequency
distributions. They were the two Scaptomyza species and four
species of Drosophila which he thereupon placed in separate sub-
genera. On the basis of the number of character differences be-
tween and within subgenera Sturtevant was able to confirm this
classification and arrive at some ideas on the relationships and ori-
gins of the various groups. He also performed a similar analysis on
29 characters of 40 genera of flies (Scatophaga, Conops and 38 as-
sorted Acalypterae) to establish the relations of the family Droso-
philidae. Unfortunately the paper cited lists only summaries of
the 1 above tables and it is therefore difficult to compare Sturtevant's
findings with ours.
Correlation between characters ( R-technique in the idiom of the
factor analysts) is the customary technique in biological and psy-
gical studies involving correlational analysis. In character
correlation matrices involving studies within one species each cor-
relation represents the sum total of the common forces acting on
any pair of characters. When analyzed by some method of factor
analysis, the matrix customarily yields a so-called general size factor,
a series of group factors affecting various groups of characters, and
residual specific factors affecting single characters only. The fore-
going is an example of a factor constellation involving morphologi-
cal characters and is not necessarily the only possible constellation.
As a matter of fact much psychometric work and the biometric
papers by Howells (1951) and Stroud (1953) use the method of
"simple structure" which a priori rejects solutions involving general
factors.
Regardless of the constellation preferred, the factors common to
two characters and causing them to be correlated could be visualized
as developmental forces, genetic or environmental in the final anal-
ysis. The range of these genetic or environmental forces is de-
pendent on the causes of variation within the sample of individuals
studied. Thus a sample of individuals from an inbred, isogenic, line
of animals would yield character correlations reflecting common
nongenetic, physiological (i.e., caused by microecological dif-
Evaluating Systematic Relationships
ferences) factors only. Another sample comprising individuals from
various races or subspecies would provide correlations based on
common factors representing (1) genetic differences between in-
dividuals; (2) genetic differences between races; (3) nongenetic
physiological differences between individuals and (4) nongenetic
ecological differences between races. One of the authors (R. R. S.)
has been able to accumulate a series of character correlation mat-
rices from various organisms representing these levels of variation.
Matrices on correlation of aphid characters within galls (clones)
and between galls have been published ( Sokal, 1952 ) while similar
matrices on aphid correlations between localities and morphological
correlations within and between strains of houseflies and Drosophila
await suitable analysis and publication.
When the sample transcends the bounds of the species the fac-
tors behind a character correlation matrix take on new meaning:
They now represent genetic divergence or the results of evolution-
ary processes. In the one case they were ontogenetic forces, in the
other they are phylogenetic forces. This type of analysis was
pioneered by Stroud (1953) who analyzed correlations of 14 char-
acters for soldiers of 48 species and imagines of 43 species of the
termite genus Kalotermes. He was able to interpret some factors
extracted from his correlation matrices as recognizable evolutionary
trends.
Another method of correlational analysis is called the transposed
matrix method or the Q-technique (as compared with the R-tech-
nique of character correlations, discussed above). 3 It consists of
correlations between individuals based on measurements of char-
acters which they have in common. In psychology this involves
correlations between persons based on scores for common tests
which these persons have taken. In the Q-technique we are in effect
dealing with the same kind of raw data as in the R-technique, but
we compute the correlation coefficients by summing squares and
products at right angles to the direction previously taken (or we
transpose the matrix before computation which amounts to the
same thing).
A Q-technique correlation coefficient in a study correlating in-
dividuals of one species represents common forces or factors acting
on the two individuals concerned. In this case we cannot speak of
the "sum total of common forces" as we could in the case of the
3. In a recent paper Cattell (1954) has suggested restricting the Q and R symbolism
to studies involving factor analysis and proposed Q' and R' for studies, such as the present
one, employing more superficial methods.
The University Science Bulletin
R-teehnique. Insofar as the characters used are indicative of the
entire spectrum of potential variation of the individuals we can
say that the resulting correlation coefficient is representative of the
real affinity between two individuals. When scanned for clusters
of high correlation coefficients the Q-type matrix reveals types of
individuals which are similar. It is thus especially suited to classi-
ficatory problems. When subjected to factor analysis the resulting
factors are now of a different nature. The general size factor has
been lost and in its place we find a general taxonomic group factor
which accounts for the overall correlations of all the individuals in
the study.
When, as in the present study, the correlation is between species
of a taxonomic unit the general factor is a general systematic factor
denoting overall relationship within the systematic group. The
species having the highest factor loading would be most representa-
tive of the group. Other factors would describe subgroups within
the systematic unit and describe the relationships of these subgroups
with each other and of the species to the subgroups. It should be
clear from the above that for purposes of biological classification
the relationships represented by a Q-technique matrix are more
meaningful by far than are those of a R-technique matrix. Except
for the above-mentioned work of Sturtevant (1942) which involved
not correlations but character differences, the only Q-type study
in systematics of which the authors are aware is in a publication
by one of them ( Sokal, 1958 ) containing factor analyses of selected
portions of the present data. A number of the phytosociological
coefficients of association and similarity can be considered as of the
Q-type.
Psychologists have used Q-technique repeatedly {e. g., Burt 1937,
Stephenson 1936), although R-technique is still preferred in most
studies. Cattell (1952) has listed 5 points of criticism of the Q-
nique. It is appropriate that we discuss briefly their relation to
the problems under study here. The first objection is that Q-tech-
nique loses the general size factor, yielding in its place a common
species factor. This latter is claimed to be trivial by Cattell, and
correctly so, for psychological work. However, in a matrix of cor-
relations between species such a general systematic factor delineates
the relation of individual species to the taxonomic group and in-
dicates the proportion of the variance of each species explained by
the general systematic factor.
Cattell's second objection to Q-technique is that it is unreasonable
to assume simple structure in the factorization of a Q-matrix. The
Evaluating Systematic Relationships 1415
authors agree with this argument, but for the purposes of the pres-
ent paper it is not important since they are not here undertaking a
factor analysis. Furthermore, they feel that simple structure ( i. e.,
S
is
not necessarily a very suitable constellation for many biological
factorizations.
e third objection refers to a customary shortcoming of Q-
matrices. They are based on few individuals and generalizations
about the entire population are drawn from them. In this study,
the matrix is of course of more than adequate size. Furthermore
our conclusions are not intended to extend to species not included
in our study.
It is true that the species recorded are an eclectic sample from
those extant in the world today. On the other hand we are of course
dealing with a sample obtained by natural selection from the multi-
tude of species or specieslike entities that have existed since the
origin of the four genera of this study. Hypotheses regarding these
extinct species will be valid only insofar as recent species reflect the
course of evolutionary history.
Another point in connection with the third objection is the num-
ber of characters employed. True relationships will become
apparent only insofar as the characters adequately represent the
sources of variation within the species.
A fourth objection relates to the lack of equivalence in recording
and interpreting the factors from the Q- and R-matrices. It com-
pares the relative permanence of psychological tests with the rela-
tive impermanence of persons. In this study we are confronted
with characters and species varying in their relative permanence,
but both equally permanent when based on the time scale of the
scientist investigating them.
The fifth criticism, labelling the Q-technique as descriptive rather
than predictive, again is invalid when applied to the present data.
Since the purpose of the study is historically descriptive and one of
our aims is to divide the population of species into categories, the
technique's fault for psychological research becomes a virtue in our
field of investigation.
There are two evolutionary situations under which it is important
to examine the two types of matrices. The first might be referred
to as breakage of correlation. It occurs when in a certain evolution-
ary line two characters that were correlated in ancestral lines and
are still correlated in related lines become independent of each
13_8050
1416 The University Science Bulletin
other.
g
onditions the R-matrix is a poor representation
>etween the two characters. There is no good
such a correlation, close in one line, absent in
the other. On the other hand a Q-matrix is not affected by such
data.
Convergence of species for a number of characters is a second
disturbing phenomenon. Here the R-matrix is not affected while
the Q-matrix is affected if the convergent characters outweigh the
nonconvergent ones in numbers.
We do not believe this is likely if an adequate number of char-
acters is studied. In case of a preponderance of convergent char-
acters and in the absence of paleontological data it is doubtful
whether the systematists would be able to distinguish convergence
from relationship by descent.
From a consideration of the above arguments it follows that given
the objectives and material of the present study the Q-technique is
to be preferred to the R-technique and the objections made by
Cattell to the former method do not apply to our case. However,
besides the theoretical reason for adopting the Q-technique as re-
flecting relationships between species there were several practical
reasons for so doing. The problem of finding a suitable type of
correlation coefficient between characters would have been formid-
able in view of the coding system adopted. Since some of the char-
acters were present in two states only while others were present in
as many as eight states, there would probably not have been any
one type of correlation coefficient for all possible character com-
binations. A matrix based on correlation coefficients of different
types would be far from desirable. Furthermore, uniformity of
computational procedure was essential to efficient handling of the
data by International Business Machines (IBM) equipment.
Not to be underestimated is the saving in computation resulting
from adoption of a 97 x 97 species correlation matrix vs. a 122 x 122
character correlation matrix. The former requires the computation
of only 4656 correlation coefficients while the latter would neces-
sitate 7381 such coefficients.
The choice of a correlation coefficient
As a next step a suitable correlation coefficient had to be chosen
to represent the correlations between species. There were serious
considerations against the use of the product-moment correlation
coefficient since the variables (species) are anything but normally
distributed. Table 1 presents frequency distributions of state codes
Evaluating Systematic Relationships
1417
Table 1
Frequency distributions of state codes for the characters of species 19, 56,
83 and 84.
State
code
Sp. 19
f
Sp. 56
f
Sp. 83
f
Sp. 84
1
2
3
4
5
6
7
54
°1
31
3
2
56
40
14
11
1
48
42
23
7
2
46
41
26
6
2
1
1
Sf
122
122
122
122
for four representative species. The distributions are highly asym-
metrical. Those for species 19 and 56 approach Poisson distribu-
tions for their means when the class codes are reduced by one.
Any interpretation of this agreement is dubious, however, in view
of the variable number of states possible per character.
Other correlation coefficients were considered and rejected. The
correlation ratio, -q, is unsuitable since ^ y does not necessarily equal
Tetrachoric r would have lost some of the information avail-
able because it would necessitate reducing all characters to two
n g
yVx
Furthermore the theoretical
of
e
normality essential to correct application of the tetrachoric correla-
tion coefficient cannot be defended for all characters.
Another method of demonstrating an association between species
would be the very simple one of counting the numbers of matches
in states for the 122 characters of any pair of species of bees and
then dividing this number by 122, the highest possible number of
such matches. The results for species 19, 56, 83, and 84 are shown
on table 2 where these "matching coefficients" are compared with
product-moment correlation coefficients. The "matching coeffici-
ents" are somewhat higher than the correlation coefficients but
resemble them in relative magnitude. In spite of this fact, "match-
Table 2
"Matching coefficients" (below diagonal) and product-moment correlation
coefficients (above diagonal) between species 19, 56, 83 and 84.
19
56
83
84
19
X
.52
.53
.50
^—^—^^—
.40
X
.61
.54
83
.37
.47
X
.87
84
.37
.38
.93
X
1418 The University Science Bulletin
ing coefficients" were not used since they have an unknown sampling
distribution, they distort resemblances by counting a 3 to 4 mis-
match the equal of a 1 to 7 mismatch, and finally they would
have been harder to handle by the IBM equipment available to us.
Lacking a more suitable means of correlation we adopted the
product-moment r, in spite of nonnormal distribution of variates
and possible heteroscedasticity. Various ways of improving the
distributions by means of transformations were tried. Table 3 shows
the same correlation coefficients as the upper half of the matrix of
— ■
2, but based on \/ X and \/ X +-5 transformations. The
slight differences obtained do not justify the extra computational
labor involved.
We have already briefly touched on the desirability of coding the
data in such a way as to put all character states on the same scale.
In a character with two states the code 2 indicates a situation dif-
g
Q
the scores for different tests are often not in comparable units. This
situation is usually met by normalizing the rows (tests, or in our
case characters ) of the raw score matrix. The authors did not per-
Table 3
Product-moment correlation coefficients between species 19, 56, 83 and 84
based on variates coded as V X ( below diagonal ) and as V X + -5 ( above
diagonal). Compare with uncoded product-moment correlation coefficients in
table 2.
19 56 83 84
19
56
83
84
X .42 .36 .37
.42 X .50 .41
.36 .51 X .93
■
.37 .41 .93 X
form this transformation since ( 1 ) it would have removed the com-
mon systematic factor from the matrix of correlations and would
thus have lowered the correlation coefficients considerably; (2) ap-
plication of the character state codes does standardize the data to
a certain extent because 76 percent of the characters have either
three or four states and only 3 percent have six or more states; (3)
although the additional labor of normalizing the variates would not
have been excessive the amount of IBM work involved in comput-
ing correlation coefficients would have been prohibitive, since a
one-digit code would not have sufficed for normalized data.
The authors are well aware that their methodology of coding and
correlation could profit by refinement. It is, however, our point of
Evaluating Systematic Relationships
view that in a pilot study of this nature such refinements are pre-
Should the general method prove of value, significant re-
sults will surely emerge in spite of minor imperfections in technique.
Compulation
The computation of a large matrix of correlation coefficients such
as the 97 x 97 bee matrix presents serious technical difficulties. Only
high speed electronic computing machines are able to perform this
operation with real dispatch. At the time our bee data were being
processed we had only punched-card tabulating machines at our
disposal. It might be noted here that a computational operation of
this magnitude cannot reasonably be undertaken without some auto-
matic computing facilities. The equipment used by the authors is
that available in the University of Kansas IBM laboratory: a card
punch (type 26), a verifier (type 56), an accounting machine (type
402) and a reproducing machine (type 514).
The computational problem was simplified somewhat by the fact
that the variates consisted of single digits only. This increased the
number of variables that the machine could process simultaneously.
Each IBM card represented a character with the state code of each
species for the particular character listed in separate columns. Since
there are only 80 columns per card, it was impossible to record all
species on any one card. A different approach was therefore
adopted and the card divided as follows:
Column 1 — Project code
Columns 2-4 — Character code number
Column 5 — Deck code (explained below)
Columns 6-8 — Left blank for possible subsequent use
Columns 9-44 — Multiplier columns for 36 species
Columns 45-80 — Multiplicand columns for 36 species.
The 97 species were divided into group I for species 1-36, group II
for species 37-72 and group III for species 73-97. Since group III
used only 25 columns another 5 columns were taken up by a repeti-
tion of data on species 1 through 5, which we used as a check on
computational procedure. Six decks of 122 cards each, one card
were then prepared.
tuted
follows :
Deck Multiplier Multiplicand
1 Group I Group I
2 Group II Group II
3 ' Group III Group HI
4 Group I Group II
5 Group II Group III
6 Group I Group III
1420 The University Science Bulletin
Different card colors besides a punched code were used to dis-
tinguish the decks.
By running these decks in succession through the tabulator we
were able to reduce rewiring of the board to one half of what it
would have been with the minimum number of decks (3).
The method of arriving at the 2x 2 and Sxy was the customary
one of progressive digiting with interspersed "X-cards." Running
time on the 402 tabulator was some 24 hours. Punching and verify-
ing of the cards had taken a similar amount of time. Thus the
preparation of the 2x 2 and 2xy for the entire matrix took about a
week. These values were computed for a half-matrix only. How-
ever, a test deck and five test variables detected wiring errors and
machine malfunction with a reasonable limit of safety.
The next step was the computation of the correlation coefficients.
This was done by computers using desk calculators. 4 The matrix of
squares and products was subdivided into manageable sections, 30
variables (species) square. All computations were checked by a
different computer and, where possible, by different steps. The
computational procedure employed was the customary L method. 5
It does not seem necessary to elaborate on the details of this method.
Any good textbook of statistics will contain a section on the com-
putation of a product-moment correlation coefficient. Furthermore,
each computation center has its own setup for correlation coefficients
depending on the capabilities of the machines and thus no general
account need be presented here.
The correlation coefficients were calculated to four significant
decimal places and entered on a matrix. Three decimal places
would have been quite sufficient for this study; however four were
computed in case later statistical work required greater refinement.
Total computation time for this phase of the work was 160 man-
hours. It should be emphasized that the time estimates given above
refer to the relatively simple equipment available to us. Digital
computers are now available which would handle the entire com-
putation, from raw data to completed correlation matrix without
human intervention in less than an hour. This would be only one-
two hundredths of the time it took us to compute the same informa-
4. The writers at this point wish to express their appreciation to Misses Betty Becker,
Marion Clyma, Jacqueline Johnson, Normandie Morrison, and Messrs. D. A. Crossley, Jr.,
Ralph Jones and Roger Price for their conscientious assistance with IBM work and desk
computation.
5. r xy = L xy / V L x VLy> where L xv = N2XY — 2X2 Y and
L x = N2X 2 — (2X) 2 , L y = N2Y 2 — (2Y) 2 .
Evaluating Systematic Relationships 1421
tion! With every passing year electronic computers are becoming
more efficient and more widely distributed. Thus the computational
aspects of our method will become a progressively less important
impediment.
Since the matrix of correlation coefficients was unwieldy (it also
had to be subdivided into sections ) and since further work with the
correlation coefficients was contemplated, the latter were punched
on 4656 IBM cards, one to a card. These cards were duplicated by
means of the reproducing punch in order to obtain cards for a com-
plete matrix of 9312 correlation coefficients. Information on these
cards included matrix row and column numbers for the particular
correlation, the coefficient with sign, and a class code for the co-
efficient. These class code numbers (1-22) represented 22 classes
of a frequency distribution of the correlation coefficients arrayed
in ascending order of magnitude with class intervals of .05. In ad-
dition, the cards contained codes for the relationship between the
two species involved as evaluated by conventional systematic meth-
ods (by CD. M.).
The correlation coefficients on punched cards have so far been
put to the following uses: We have compiled a printed tape record
of the full matrix, column by column, which has been very useful
for reference and further computation. Another tape has been com-
piled giving a listing and frequency distribution of the correlation
coefficients grouped in the 22 size classes. This tape has been of
great value in various approaches to a classification of the relation-
ships demonstrated by the matrix. A third tape lists the sums of
the correlation coefficients, column by column. This has been nec-
essary for the B-coefficient method briefly described below. A
fourth tape presents a two-way frequency distribution showing the
relation between correlation coefficients and the relationship code
developed by conventional systematic methods. These tapes were
prepared in a few hours running time from the correlation coefficient
we still expect to use in a variety of ways.
The matrix of correlation coefficients
In the bee study the 4656 correlation coefficients computed in
the above manner ranged in magnitude from — .0626 for the cor-
relation between species 26 and 92, to .9747 for the correlation be-
tween species 43 and 44. 6 As was mentioned previously, a fre-
— . — i
6. For lack of space the matrix cannot be reproduced here. Microfilm or IBM-tape
or card copies can be obtained through the Secretary, Department of Entomology, Uni-
versity of Kansas. Lawrence.
1422 The University Science Bulletin
quency distribution of these coefficients, grouped into 22 classes
with class intervals of .05 was set up. The modal class showed a
class mark of .38; this represents the most frequent class of cor-
relation coefficients found between species in this study. How-
ever, a second mode was located at .78. This bimodality would
indicate that we are dealing with two populations of correlation
coefficients: those indicating close, possibly intrageneric relations
and others representing more distant relations. Codes representing
Michener's previous views on the relationships among the species
were correlated with the above coefficients. The single correla-
tion coefficient between the correlation matrix and Michener's codes
was .80. It was encouraging to find that magnitude of the corre-
lation coefficients in our matrix was apparently an estimate of
systematic relationship as indicated by the previous classification.
Another way of examining these correlation coefficients is to study
frequency distributions of the coefficients for any single species
against all other species. By this means we were able to distinguish
members of closely related groups of species from isolated species
within a genus and these in turn from very isolated species represent-
ing monotypic genera or subgenera. For a detailed discussion and
illustrations of this procedure, the reader is referred to Miehener
and Sokal ( 1957 ) .
The absence of significant negative correlations from our matrix
requires some discussion. Q-technique matrices of correlations be-
tween people (based on psychological tests) are quite likely to
yield such correlations. If there are distinct, antithetical types of
persons represented in the matrix, such as extroverts and introverts,
it is likely that a high score for one type will be a low score for the
other and vice versa. In our case evolutionary progress may be
represented by either an increase or a decrease in state codes. In
the majority of characters the supposedly primitive situation is an
intermediate state code with two diverging evolutionary trends rep-
resented by the lower and higher code numbers. Furthermore,
characters representing correlated trends were not necessarily coded
along the same scale or in the same direction. It is clear that under
such circumstances distantly related forms are likely to be uncor-
related rather than negatively correlated.
The search for group structure
The matrix of correlation coefficients between species can be put
to a variety of uses and the analysis reported below represents
merely an initial effort at an exploitation of the data. The correla-
Evaluating Systematic Relationships 1423
»
tion coefficients serve as an absolute measure of relationship be-
tween any two species in our study, limited only insofar as the
characters chosen do not represent the total correlated variation of
the two species.
The search for structure among the correlation coefficients of the
matrix is of course no different in aim from the search by the sys-
tematist for a natural system in an array of species. Such a system
consists of a hierarchy of groups. Various methods can be used for
discovering a hierarchy in data such as ours. A customary, rather
simple device of the psychometrician is so-called "cluster analysis,
developed to a fine art by Tryon ( 1939 ) .
A concise description of the procedure (the ramifying linkage
method) is given in Cattell (1944) and Thomson (1951). Because
of the simplicity of the procedure, cluster methods are used exten-
sively, although Cattell (1944, 1952) and others have pointed out
that cluster analysis cannot be considered a substitute for the more
involved factor analytic methods. Attempts to employ cluster anal-
ysis for finding structure in our matrix were only partially success-
ful, since the resulting clusters were partly overlapping, i. e., a given
species might be a simultaneous member of two clusters. This
makes good sense for intermediate forms in an abstract scheme of
relationships. In a systematic hierarchic classification, however
groups at the same level have to be mutually exclusive for practical
as well as for theoretical reasons, except for low level groups exhibit-
ing reticulate evolutionary pattern (rare above the species level in
animals). A further reason for the unsuitability of cluster analysis
is the complexity of the clusters as more species are added to them.
Although clusters are therefore not convenient in an initial search
for structure, the diagram of relationships established by methods
to be described below could be easily recognized in the clusters
outlined by cluster analysis. A method essentially similar to cluster
analysis is the p- group and pF-group method of Olson and Miller
(1951) applied to three paleontological R-technique matrices. It
suffers from the same drawbacks as cluster analysis. 7
7. After the present research and manuscript had been completed one of us (R. R. S.)
became acquainted with the psychometric work of Professor Louis L. McQuitty of the
Michigan State University, who in recent years has developed a whole battery of refined
cluster methods (McQuitty, 1955, and a series of papers in press in The British Journal
of Statistical Psychology, Educational and Psychological Measurement, and Psychological
Monographs). Several of these papers deal with psychological problems which are closely
related to those of biological classification. One of the methods invented by McQuitty
bears a close resemblance to our variable group method developed below. It is interesting
(as well as reassuring to us) that workers in different fields had unknown to each other
developed some of the same formulations. We hope to try some of McQuitty's other methods
on our material. They have the advantage of simplicity and can be programmed for elec-
tronic computation without much difficulty. Indeed the time may not be far off when
computation for a study such as our bee work will be a minor matter routinely handled by
a computing center in a very few hours and the remaining problem will be the collection
of data for the machine and the interpretation of the voluminous answers that are produced.
9
The University Science Bulletin
As a technique for grouping the species we experimented exten-
sively with the coefficient of belonging (B-coefficient) of Holzinger
and Harman (1941). It is the sum of the correlations among the
members of a group divided by the sum of the correlations of these
group members with the other variables ( species ) of the study.
Results of our B-coefficient analyses for the bees were reasonably
good, as judged by the previous classification and by our subsequent
investigations. There was one main drawback, however. Large
species groups showed a lack of structure and relatively low B-
coefficients which would make the species in these groups appear a
good deal less related to one another than members of groups of
two or three species. The cause of this phenomenon is not hard to
find. In large species groups the denominator of the B-coefficient
would include high correlations due to correlations of group mem-
bers with numerous other prospective members not yet included in
the group. This would tend to depress the B-coefficient values. By
the time all such members have been admitted to the group, it has
become so large that even the admission of a relatively unrelated
variable will effect the B-coefficient only slightly.
In view of the disadvantages of the B-coefficient we developed
our procedure which is presented below in a general manner to-
gether with some of the reasons for its adoption. This presentation
is followed by a detailed step-by-step account of the computational
procedure for readers who wish to become more familiar with it.
A nucleus of a group was established, using the two species hav-
ing the highest coefficient of correlation. Then species would be
added to this nucleus, one at a time, always adding first the species
having the highest average correlation with members of the group.
The limit of the groups could be found by decreases ( L„ + x - L n )
in the level of the average correlation L n , where the subscript refers
to the number of members in the group. As in the B-coefficient a
significant drop is empirically determined since sampling distribu-
tions of average correlations, such as L n , are unknown. By develop-
ing first lower groups (species groups), then by the same method
grouping these into larger groups (sometimes subgenera), and
these into still larger or higher groups, etc., it has been possible to
develop a hierarchy of groups for which the diagram of relationships
( figure 1 ) can serve as a representative. Each number in this figure
represents a different species; for a list of the species concerned see
Michener and Sokal (1957).
Since Ln is not amenable to rigorous statistical treatment it was
decided to recompute correlation coefficients (using Spearman's sum
Evaluating Systematic Relationships
1425
4
C
£
— b
o
00
<
a
e
f
lr
63 69 65 6T 70 75 77 83 78 79 94 85 88 90 92
64 76 66 65 71 T2 82 81 74 80 73 «T 88 91 93
I
900
U
800-
700-
u 2 -°
O .- 3
"II — I
94 96 97
85 66
l
i
f
L
600-
ANTHOCOPA
500
PROTERIADES,
20,30 HOPL1TI8, 40 & 41.
\
\
^-
400*
Fig. 1. Diagram of relationships for the genus Ashmeadiella obtained by
the weighted variable group method. Ordinate: magniture of correlation coef-
ficient multiplied by 1000. Exact correlations between any two joining stems
can be found by reading the value on the ordinate corresponding to the hori-
zontal line connecting the stems. This value becomes approximate and maxi-
mal in cases of multifid furcations. Broken lines used where more than three
stems join are for convenience only; the horizontal connecting line has the same
significance as elsewhere. "Roofs" over species numbers at the summits of the
lines delimit subgenera containing more than one species, as based on C. D. M/s
previous findings and not on this study. Generic names are in small capitals.
The horizontal broken lines are not relevant to the present account; they are
explained bv Michener and Sokal (1957).
of variables method) after the group limits at each hierarchic level
had been reached. Thus we returned at the end of each grouping
procedure to a new matrix of correlation coefficients about which
confidence statements might be made. Two further considerations
in the final choice of a method for grouping remain to be mentioned:
We might have admitted only one new member for each group
at a given hierarchic level, thus obtaining a diagram of relationships
consisting of bifurcations only. We have called this method the
1426
The University Science Bulletin
pair-group method as contrasted with the variable-group method,
where any number of new members can be admitted to the group at
any one hierarchic level, the limit of the group being determined by
a significant drop in L n . The pair-group method has some theoreti-
cal justification in that much evolutionary ramification is believed
based on speciational processes involving the splitting of one species
into two. However, there must also occur some speciation as a re-
sult of the splitting of a species into more than two isolates and, on
the assumption of equal evolutionary rates for these new lines, the
pair-groups method would fail to represent the true situation. More-
over, many of the groups must be markedly different, not merely
because of divergence, but because of extinctions of intermediates.
A group might be broken into any number of different subgroups
by different extinctions. Furthermore an empirical study of this
method (see fig. 2 for an analysis of relationships in the subgenera
Chilosima and Ashmeadiella by the pair-group method and com-
pare with the left side of fig. 1 for the variable-group method) dem-
onstrates that in spite of the pair-group device we are forced into
multified furcations by drops in L n too small to plot or by temporary
1000
63 69 65 67 70 76 77 82 78 79 84
64 75 66 68 71 72 83 81 74 80 85
73
900
800
700
\ /
T
600 J
Fig. 2. Diagram of relationships far the subgenera Chilosima (63-64) and
Ashmeadiella $. str. ( 65-85 ) , obtained by the method of pair-groups, L e. dia-
grams would ideally consist of bifurcations only. Stems have been weighted.
Explanatory comments as for figure 1.
Evaluating Systematic Relationships
A
B
C
D
A
B
C
D
E
F
G
H
A ,
b.
S
T
U
V
w
X
Y
Z
Fig. 3. Hypothetical diagrams of relationships to illustrate effects of different
thods of weighting stems. For explanation see text.
reversals of L n values, discussed below. Thus the variable-group
method was adopted as the more reasonable and flexible of the two.
A second consideration is how to weight the variables during the
recalculation of the correlation matrix after each grouping pro-
cedure. A simple diagram (fig. 3a) will make this issue clear. A
and B represent the two species with the highest correlation coef-
ficient. The L n for C against A and B is significantly below r ab , so
that A and B are represented as being closer to each other than they
are to C. When studying the relation of a fourth species D with
group ABC we face the following problem: Should we calculate the
correlation of ABC against D with A, B and C equally weighted or
should we weight A = B and AB = C? Rephrased biologically, the
problem is whether to relate species D with the homogeneous group
ABC, or with the stem AB-C, where C carries as much weight in
determining the relation with D as do A
nd B together.
Although
in a simple case, such as the one described above the two alternatives
may not
uce very
rent results, in a situation such as de-
1428 The University Science Bulletin
picted in fig. 3b species H might be weighted as % of the group A-H,
or J2, depending on the system adopted. Similarly species B would
be weighted % in the former case but only Yn in the latter case.
When dealing with fairly large groups the second method would
therefore reduce the weight of the early admitted members and in-
crease the weight of those species admitted later.
The same problem is found in a situation such as shown in fig.
3c. By the first method species T is weighted 3b, by the second
method it is weighted only H*. Neither of the two methods is en-
tirely satisfactory. By method one we are reducing the importance
of species H and Z in representing groups A-H and S-Z respectively.
If the relationship diagrams of figures 3b and 3c depict true phylo-
genetic relationships, then H and Z should represent half of their
respective lines regardless of subsequent diversification in the other
halves. On the other hand giving relatively greater weight to single
late arrivals also gives heavier weight to specialized features of such
species and thus would tend to distort the relational pattern, while
specializations in the diversified branch of the stem tend to cancel
each other, permitting a better average picture of the groups to
emerge. The optimal system of weighting would be one between
these two extremes, weighting each species according to its number
of generalized and specialized features. This is clearly impossible
without renewed introduction of a subjective element into our pro-
cedure. We therefore adopted the second method, f. e., the weight-
ing of new members as equal to the sum total of all old group mem-
bers, thinking it to be the less objectionable of the two. We feel
that this method will represent stems more correctly and that bias
introduced by specializations of late joiners will be kept down by
the large number of characters considered in our study.
We are reassured in our decision by the results of a comparative
study on the subgenera Chilosima and Ashmeadiella. Figure 4
shows the results of a variable-group analysis of these subgenera by
weighting method one, while results by method two can be seen in
the left side of the diagram of fig. 1. General agreement as to re-
lationships and level of furcations is very good. The main difference
between the two diagrams is that in method one group 77-81 8 first
receives 79 before receiving group 67-72, 78 and 84, while in method
two it first receives group 67-72, then 78 and then 79 among others.
■y
8, In the interest of brevity groups will be identified by their leftmost (in the diagram)
and rightmost members with a dash separating the two. Thus 77-81 means group 77, 82,
83, 81. It clearly does not include all species ranging in number from 77 through 81,
and includes some beyond that numerical range.
Evaluating Systematic Relationships
1429
1000
64 75 66 68 71 72 82 81 78 74 85
63 69 65 67 70 76 77 83 79 94 80 73
900
800
x
I
700
X
600
(63-64) and
ure under
Fig. 4. Diagram of relationships for the subgenera Chilosima
Ashmeadiella s. str. (65-85), obtained by the variable group
weighting method one, i. e., equal weights for all stems. Explanatory com-
ments as for figure 1.
Careful examination of the original correlation coefficients makes
the reasons for these differences clear. Group 77-81 is closer to 79
than to 78 except that 81 is closer to 78 than to 79. Also 81 is closer
to 67-72 than are 77,
and 83. Therefore in method two, where
81 receives as much weight as 77, 82 and 83 together, 67-72 joins
the nuclear group first. This is also partly due to the fact that un-
equal weighting of the species in the 67-72 group favors those close
to the 77-81 group. Since 78 is closer to 67-72 than 79, the latter,
while originally quite close to 77-81 is now temporarily delayed and
78 joins the combined group 67-81 before 79 does. These relations
are at too low a phyletic level to be included in the original diagram
of relationships drawn by C. D. M. who feels that there is little that
can be obtained from classical systematic studies of these species to
suggest whether method one or method two is preferable. In view
of the small over-all differences between the two methods and
especially in view of the fact that the lines concerned all join by
either method with a difference in correlation coefficients of less
than .06, it may well be that we have made too great an issue of the
matter.
two
will present us with a reasonably bias-free picture.
1430 The University Science Bulletin
The Weighted Variable Group Method
It was thought advisable to give a detailed account of our method
in order to enable readers to repeat the operations should they so
desire. The subgenera Chilosima and Ashmeadiella, which have
been used as a testing group before, will serve as an illustrative
example. These subgenera include species 63 through 85 (see
figure 1 ) .
Correlation coefficients among these 23 species are shown in table
4. All values are significant with probability values of less than one
percent. The highest correlation coefficient among these species is
.965 for 67 x 68. 9 This is also the highest correlation involving either
of these two species. The next to enter group 67-68 is species 70
which has the greatest average correlation ( L n = .892 ) with 67 and
68 since 67 x 70 == .896 and 68 x 70 == .889. No other species in the
study has as high an average correlation with 67-68, as can be
learned from a few trials. We established empirically, as a result
of numerous trials, that a drop in L n of .030 gave a satisfactory limit
for groups; therefore 70 is not to be admitted to group 67-68 at this
particular time. Another high correlation involving species other
than 67 and 68 is 77 x 83 = .951. This is also the highest correlation
for the two species concerned. Next to join this nucleus is species
82 with an L n value of .936. The drop is less than .030; therefore
82 is admitted. Next to join is species 79 with an L n value against
77, 82 and 83 of .905. There is now a significant drop from the
previous L n value and 79 is excluded for the time being. Drops in
L n are always measured from the previous L n , not from the initial
Ln. Our second group is therefore 77-83. In a similar manner we
established groups 63-64, 65-66 and 71-76, each consisting of only
two species.
So far only 11 species out of the 23 of the study have been placed
into groups. A systematic survey was then made of the remaining
12 species to see if any group had been missed. For example, ex-
amination of species 69 revealed that its highest correlation was
with species 72 ( 69 x 72 = .820 ) . However, this latter value was not
the highest correlation for 72, since 72 x 76 = .904. Thus 72 might
eventually join the group containing 76, and 69 might join the group
containing 72, both of which events came to pass at a later sta ( :e of
the analysis. At the present time, however, species 69 and 72 are
ed to any group. Similarly the remaining ten species
in the study were shown not to belong to any nuclear group. To
9. We shall use this symbolism in place of the more formal r^.
OS
Evaluating Systematic Relationships
1431
I
10
o
CO
CO
.22
ft
CO
G
o
E
<
CO
C
3
O
U
cs
CD
u
O
u
o
W
C/l
■ph4
ft
CO
o
X
"C
S
a
o
0)
X
W
00
CO
oc
CM
OC
00
00
GO
CO
-t
CM
l>
l>
co
00
CO
CO
CO
o
;o
CD
CO
KN
X
X
rjco
CM
00
X
X!
k> 00
1
nco »o
rj oooo
HcM|>l>
U 00 00 CC 1
Tf »C "^
O 00 00
t- 00 Oi
O 00 LQ
O^I> 00
Tf< OS CM
C7> f CO
00 00 t*-
CM 00
Oi *0 t^-
O CM CO ^ oo »o CO
O CM 00 ^t 1 CO CM CM
00 00 00 00 00 I> I-
CO co *o ^O i— ' CO OS
CM CO 00 CM iO 00 CI
Oi 00 00 C5 Oi 00 00
X
ooiooocMOcor — f
OiOiCOOI>-tO , rt | COGC
J>O0t^00O000OCI>t^
. J CM 00 CO
p*J io co o
ij i> 00 CO
CDNONH03C0
CM O A 4 CM CO CO
oo oo i> oo oo t^. r^
CO i-h CM CM CM 'O Tfn 1> b~ iO CO
O5OiCDCOCOCOCOLO00^^
t^t^oot^oooooooooooooo
X OS GO
^00 00
X
X
X
X
XI
XI
COCO*-<I>-CMCMO>-OC7>CMI>CM
CMOOiOCOI>.t^OCOOOCOCO
oocot>-i>-r>-i>-t^ooooi>-cooo
GO t^» tF O CO *— t^-'GOCOQOiO'^iO
i>t^xa^oOQOoor^QOoooot>oo
Oi»O'-«iOCie0Tt<00C0C*i-'©b-
N O N ^f O O CO CM t^» LO CO ^ 00
J>0C't^O0000t^-00000000t^t^
t^OOCMCMOOOOOCOCM'-^COiOTr'
l>l>t^Oit^00t^t^X0000l>l>.
X
X
X
x rt
x^
L j J>* CO
XCM O
k>* O 00
X
X
* CM 00 _
X t^CM CM
kJt^i>oo
ONih^^^00cDC0N»OO
OONCOCOOCOrHNOOCOiOO
t^coi>r^t^i>t^i>t^i>-i^cooo
t^OiOOOOsX^OOCOCMCOt^lvcOOaoOOS
M QO »C N H IQ LO O CO O CO LO T^ rH H CO H
oooooooot^t^i>oooot^r^t^.oooooot^-oo
LOiOCD^OiOiONCO^^XNNCO)iO>CO
COI>-OiCO^COr^COOOO'— <^00CMOii-HCMOi
ot^oooooot^r^i^ooooooir^t^ooi^ooir^t^
l>-COCMCOCOO^^t N -Gi^OOCOOOOCO'— tt^-UO
cot^cocTSTroicot^OoO'-HCOcooooOdTf'O
^coooi>cocoi>t^ooi>t^i>i>.t^i>oot^co
X
X
X
^ 00
Xo
X
NiCO^OiOiMCONCDNOWWCOcO^OS
i— iCMCOCO^OOOOOCO(MCOCOCMCM»OCOC:iO
OOt-^OOOOOOCOOOt^OOOOt^OOOOOOOOOC't^t^-
COCOi-HCM^ J OOtCO'— <l s ^'— '^hqO C: C '-O O h h ^ iO
lOMCONwrHtNrH^CO'HTtiC'H^CNCOCOOOO
CDCDCCOl^NNNCOCDCONCNCDCCDCOCOiOCO
- -
CM^C^lCMCOCMiOOOCMOir^COt^t^COiOr^OCMiO
L?(MOOOlNiOQOCOCOOOOOCNCO^CO^GO^^
COCOCOCO^Ot^COCOiO>O^OCO»OCO^OCOCOCO»OiOiO
co ^ »rr cc t^ oo a^ o ^ cm co ^ lo co t-- oo o o ^ ci co ^ io
CO CO CO O CO CO CO t> t^ I> t- t> t^ l^ l> I> l> 00 oo or- OO 00 oo
1432 The University Science Bulletin
set up one of the latter we required a correlation coefficient which
was the highest one for both participating species (i.e., the re-
ciprocally highest correlation).
After the groups had been delimited a new correlation matrix
was computed considering the newly formed groups as single
variables, i.e., the previous matrix of 23 variables (matrix 1) was
reduced to one of 17 variables (matrix 2). It is self-evident that
the only correlation coefficients in need of recomputation were those
involving new groups. Correlations involving only species that had
remained single were not altered in any way. As a matter of fact,
a procedure was devised by means of which the correlations were
not even recopied, but variables joined into groups were crossed
out and the new group variables were entered along the margins of
the old matrix. The actual computational procedure is quite simple
and considerably less complicated than the computations for finding
the original correlation coefficients. It is described in the paper by
its originator (Spearman, 1913) and also by Holzinger and Harman
( 1941 ) . Let us illustrate this method by computing the correlation
between groups (63) x and (67)i 10 . The general formula for this
computation is
DqQ
•
q.Q
-^— ^^
— »
V q + 2Aq V Q + 2AQ
where □ qQ is the sum of all correlations between members of one
group with the other group, Aq is the sum of all correlations be-
tween members of the first group, AQ is a similar sum between
members of the second group, q is the number of species in group
one and Q the number of species in group 2. Thus in this particular
case DqQ equals (63 x 67) + (63x68) + (64x67) + (64 x 68)
= .692 + .682 + .681 + .672 = 2.727; Aq in this case equals only
(63 x 64)= .90S while AQ equals 67 x 68 = .965, since each of these
groups consists of two species only. In cases where a group con-
sists of 3 species, for example, the A term consists of the sum of
(1x2) -f- (1x3) + (2x3). In the present case q = Q = 2 species.
Substituting into the formula given above:
(63) 1 x(67) 1
2.727
V 2 + 2(.908) V 2 + 2(.965)
10. The notation (63)i, refers to the group of species formed in matrix 1, the lowest
numbered member of which is species 63, i.e., to group 63-64. Similarly (67)i, refers to
67-68, and (77),, to 77-82-83.
Evaluating Systematic Relationships 1433
These computations can be set up in a systematic manner and are
then neither particularly complicated nor time consuming. In the
special case where we wish to calculate the correlation coefficient
between a single species (x) and a new group (q), the formula is
amended as follows:
^
^ r X Mi
*x*q
V q + 2Aq
An illustration is the correlation of species 69 with group (77 ) x . 2r x . Q
equals ( 69x77 ) + ( 69x82 ) + ( 69x83 ) = .764 + .788 + .738 — 2.290,
while Aq=(77x82) + (77x83) + (82x83)=.925+.951+.948=2.
^ — ■ — —^—-
2.290
Then 69 x (77) x = = .779
V 3 + 2(2.824)
In such a manner a new 17 x 17 correlation matrix ( matrix 2 ) was
constituted. From this point on the species groups [(63) 19 (65) a ,
(67) 1? (71) r , and (77) J were tested as though they were single
Once matrix 2 had been computed the identical grouping pro-
cedure was followed. Group (71) x had a mutually highest correla-
tion withjpecies 70 at *923. They were then joined by 72 and group
(67) x at L n levels of .903 and .885 respectively. These affiliations of
72 and (67) 2 were also their highest correlations. The next prospec-
tive joiner was species 81 at L n = .859, i. e. not quite the established
drop of .030. However species 81 had highest relations not with the
previous species but with group (77) x with a correlation of .923.
Therefore it was excluded from consideration as a candidate for the
earlier group and the runner up, species 78, used instead. The latter
gave an L n value of .844, clearly a significant drop from .885. Species
81 meanwhile was used in a nucleus of a new group 77-81 [(77) J.
Situations such as the above were the exception. In general the re-
lations and choices were entirely straightforward and could be left
to the discretion of the computing assistants.
At the end of each grouping procedure the remaining single
variables were checked to avoid missing groups with low correla-
tions between members. With each grouping procedure the matrix
of correlation coefficients became smaller and the job of recomputa-
tion less. The weighting procedure adopted by us was automatic
in that all correlation coefficients used were from the previous matrix
and not the initial one. It took eleven matrices to obtain a single
The University Science Bulletin
group out of the 23 species of the two subgenera. This amount of
work could have been reduced by raising the minimum recognized
difference in L n level above .030 but there would have been a re-
sulting loss of detail in the diagram of relationships. Conversely,
however, reducing the recognized difference below .030 would not
have increased the meaningful detail, since even .030 was too small
to prevent the occasional reversal of r values discussed below.
Once computed, the relations were represented as diagrams of
relationships as in figure 1. The ordinate at the left of each diagram
is graduated in units of 1000 x r. The correlations between any join-
ing stems in the diagram can be read by measuring the level along
the ordinate of the horizontal line connecting the stems. Thus
species 63 and 64 are correlated at a level of .908, while group 63-64
is related to group 67-72 at .702. Furcations involving more than
three lines are shown by broken lines converging on the midpoint
of the horizontal line as in group 88-96 of the above figure. The
tops of the figures are at a level of 1000 (correlation of 1) since
obviously each species is perfectly correlated with itself.
In cases of groups of only two stems the L„ level corresponds to
the correlation coefficient of the two stems. When more than two
stems join to form a group the highest L n level was graphed for all
group members. Thus while 77 x 83 equals .951, L n for 82 against
77 and 83 equals .936. The group of these three species (77-83) is
shown related at levels .951. Occasionally the correlation coefficients
for the same group in successive matrices will rise a little. Thus in
this same figure groups 63-64, 69-85, 87-96, and species 86 are shown
joining at .702. The first three groups actually joined at .671, but
species 86 which joined their group at the next matrix did so at
level .702. This type of situation, which occurred infrequently, might
lead one to express concern about the validity of the method, since
regular decreases in levels of correlation coefficients and L n values
are expected. However, it can be shown from Spearman's formula
for the correlation of sums of variables that slight increases in the
levels of correlation coefficients of the sums of variables above the
correlations of their component variables are possible. For ex-
ample, if A and B have formed the nucleus of a group at r a . b = .9
and G is about to join them, then by the rules of the variable group
method both r a . c and r b . r must <r a . b = .9. It can then be shown
that r (ab ). c must be <.925. Thus r ab . r , while it will usually be <r a . b ,
could be slightly more than .9. Similar situations can be shown to
exist with larger-sized groups. The increases found by us were well
Evaluating Systematic Relationships
below the mathematically possible limits. In all such cases the re-
lations were represented as multifid furcations of all the stems in-
volved in the reversal and at the highest of the several L n levels con-
sidered.
In a successful method of studying relationships, the results of
the analysis should be relatively independent of the number of
species in the correlation matrix. If at least one species per species-
group is included in a matrix, the ideal method of analysis should
reproduce the diagram of relationships based on an earlier study of
a larger matrix. If the method can be shown to produce similar
results, the fact that our matrix contains only a sample from the
population of species can be ignored with greater assurance.
We tested this question by subjecting the odd-numbered species
in the entire genus Ashmeadiella to a weighted variable group anal-
. This should give an adequate cross-section of relationships in
that genus. Since some trends might be lost by exclusion of the
69
65
71
83
79
85
89
93
97
1000
63
75
67
77
81
73
87
91
95
900
I
I
r
800
700
■r
w
%
600
m*
Fig. 5. Diagram of relationships
genus Ashmeadiella on the basis of
planatory comments as for figure 1.
predicted for odd numbered species of the
the relationship diagram of figure 1. Ex-
1436
The University Science Bulletin
even-numbered species a special diagram of relationships was pre-
pared from figure 1 by using only odd-numbered species. Figure 5
shows this predicted diagram of relationships.
Figure 6 shows the results of the weighted variable group analysis
on the odd-numbered species. There is less structure in this dia-
69
65
71
83
79
85
89
93
97
1000
63
75
67
77
81
73
87
91
95
900
800
t
a
\
700
I
I
600
an independent
of the genus
weighted
Fig. 6. Diagram of relationships obtained by
variable group analysis of the odd-numbered species
Explanatory comments as for figure 1.
gram as compared with the predicted diagram of Figure 5. This
was to be expected since some of the structure was based on rela-
g the missing
tween the two
In general,
we therefore feel reassured that the species left out in our study
g
Evaluating Systematic Relationships 1437
CONCLUDING COMMENTS
A detailed discussion of the comparisons of our findings with the
classifications of the four genera of bees is given by
Michener and Sokal (1957). It will suffice here to state that general
agreement was good but that a number of taxonomic re-evaluations
seemed necessary as an outcome of these analyses. It should be re-
membered that while diagrams such as figure 1 may suggest phyto-
genies, in reality they only indicate static relationships. As indi-
cated in the paper referred to, additional refinements were devised
to give diagrams of relationships which we believe more nearly ap-
proach phylogenetic trees.
In view of these results we are encouraged to believe that, since
the methods we have described are increasingly practical with the
growing availability of high speed computers, this or similar schemes
will be more widely utilized with different groups of organisms.
Although the method we have described is a first attempt and would
profit by either simplification or refinement, we believe it is a step
toward reducing the subjectivity of systematic work, and therefore
a step in the right direction.
LITERATURE CITED
Burt, C. L.
1937. Correlations between persons. Brit. Journ. Psychol., vol. 28, pp.
59-96.
Cattell, R. B.
1944. A note on correlation clusters and cluster search methods. Psy-
chometrika, vol. 9, pp. 169-184.
1952. Factor analysis. Harper & Brothers, New York, pp. xiii + 462.
1954. Growing points in factor analysis. Austral. Journ. Psychol., vol. 6,
pp. 105-140.
Holzinger, K. U., and H. H. Harman.
1941. Factor analysis. Univ. of Chicago Press, Chicago, pp xii + 417.
HOWELLS, W. W.
1951. Factors of human physique. Amer. Journ. Phys. Anthrop., n. s., vol.
9, pp. 159-191.
U1TTY, L. L.
1955. A method of pattern analysis for isolating topological and dimen-
sional constructs, Research Report AFPTRC-TN-55-62, Air Force
Personnel and Training Center, Lackland Air Force Base, San
Antonio, Texas, v + 38 pp.
The University Science Bulletin
SOKAL
1957. A quantitative approach to a problem in classification. Evolution,
vol. 11, pp. 130-162.
Olson, E. C, and R. L. Miller.
1951. A mathematical model applied to a study of the evolution of species.
Evolution, vol. 5, pp. 325-338.
SOKAL
1952. Variation in a local population of Pemphigus. Evolution, vol. 6,
pp. 296-315.
1958. Quantification of systematic relationships and of phylogenetic
trends. Proc. Tenth Internat Congress Entomology, Montreal,
(in press).
Spearman, C.
1913. Correlations of sums or differences. British Journ. Psychology,
vol. 26, pp. 344-361.
W
ourn
344-361.
Stroud, C. P.
1953. An application of factor analysis to the systematics of Kalotermes.
Systematic Zool., vol, 2, pp. 76-92.
Sturtevant, A. H.
1942. The classification of the genus Drosophila with descriptions of nine
new species. Univ. Texas Publ., no. 4213, pp. 1-51.
Thomson, G. H.
1951. The factorial analysis of human ability. 5th ed. Houghton Mifflin
Company, New York, pp. xv + 383.
Nr Tryon, R. C.
. Cluster analysis. Edwards Bros., Ann Arbor, Mich., pp. 1-122
#